Sequence Assembly Practical
Objectives:
To learn the basics of how to use Canu assembler.
Introduction
In this practical, we will learn the intricacies of assembling a ~5 Mb haploid genome using long read PacBio sequencing technology.
Familiarize with PacBio sequencing
See the video from time 1:26 to 2:36.
Pacific Biosciences Glossary of Terms
Now that you know the sequencing part, the next thing to do is to learn the genome assembly concepts. We will use the Canu assembler version 2.0 to perform the genome assembly. More information on Canu here: https://github.com/marbl/canuMore info on Canu publication here: https://genome.cshlp.org/content/27/5/722.long
There are three important stages when assembling a genome with Canu.
- 1. Correction – This step is to increase the accuracy of bases in all reads.
- 2. Trimming – This step is to trim reads to the regions that consist of high-quality bases and discard suspicious sequences such as the PacBio adapter (i.e. SMRTbell).
- 3. Assembly – Reads are ordered based on their overlaps to create contigs and produce final consensus sequences. In addition, this assembler can also output the assembly graphs in GFA format.
Methods
Copy an example Escherichia coli K12 PacBio fastq to a folder that you will name assembly. This is a subset of reads from this source: https://github.com/PacificBiosciences/DevNet/wiki/E.-coli-Bacterial-Assembly
ls
pwd
mkdir assembly
cd assembly/
cp
Perform contig assembly
canu -version
canu -p ecoli -d ecoli-pacbio genomeSize=4.8m corThreads=2 -pacbio-raw pacbio.fastq
To understand what the options mean, please read https://canu.readthedocs.io/en/latest/parameter-reference.html. Write your answer next to the options.
The essential options for Canu are:
-p -d genomeSize corThreads -pacbio-raw
Show answer
-p assembly-prefix -d assembly-directory genomeSize expected genome size is 4.8 megabases corThreads number of threads for the correction step -pacbio-raw raw PB data
The program will run for quite a while depending on your computer and the number of cores you decided to use. We will go through some exercises to cement your knowledge on assembly.
If you have used older Canu version e.g. v1.7 and failed at the precompute stage due to Java Virtual Machine -d64 error, you can fix it by changing mhap.sh and precompute.sh. Or get a newer Canu version e.g. v2.0. Show more…
Errors
Fixing error demonstrates what a typical day in the life of a bioinformatician may be like. It will also show a feature of Canu that allows incomplete assembly steps to be resumed. First, we need to understand the problem. In case you encounter errors, you can find the last command when Canu tried to run something but failed. See:
./precompute.sh 2 > ./precompute.000002.out 2>&1
Find the script, precompute.sh, and the corresponding *.out.
Example of an error output:
Running job 2 based on command line options.
Dumping reads from 9001 to 12528 (inclusive).
Starting mhap precompute.
Unrecognized option: -d64
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Mhap failed
That is our problem right there. It is good to Google for ‘JVM -d64 problem’ and stackoverflow sometimes have the solutions when other users encountered similar problems, see stackoverflow.com. It is also a good practice to check on the official Canu website to see if the problem has been asked and resolved before.
grep "d64" *.sh
Output:
mhap.sh:arch=`uname -m | sed s/x86_64/amd64/`
mhap.sh: /home/student/miniconda3/envs/denovo/bin/java -d64 -server -Xmx6144m
\
precompute.sh:arch=`uname -m | sed s/x86_64/amd64/`
precompute.sh:/home/student/miniconda3/envs/denovo/bin/java -d64 -server -
Xmx6144m \
Fix the script by removing the‘-d64’ option.
sed 's/-d64//' precompute.sh > tmp; mv tmp precompute.sh
Resume assembly by rerunning the assembly command.
canu -p ecoli -d ecoli-pacbio genomeSize=4.8m corThreads=2 -pacbio-raw pacbio.fastq
Exercises
1. How many reads do we need to sequence to achieve 25X coverage of a genome with size of 4.8 Mb, and each read length is 1 kb? Show answer: 4.8e6 * 25 / 1e3 = 120000 reads
2. Do you think PacBio generates the same read length each time? Why yes or why not? Show answer: No. The polymerase in each ZMW fails at different time.
3. Below you will find 7 reads. Try to piece them together to rebuild a reference sequence. In your process to piece the reads together, what problems did you encounter? Try it for 15 minutes before reading section 3.1 – 3.6 in below link for a more comprehensive answer on assembly problems. The Principles of WGS Sequencing and Automated Fragment Assembly
GACACTCGTG
CGTGCACTT
CACTTGAGTA
GAGTAGGCTA
ACTCGTGCAC
TGCACTTGA
ACTTGAGTAGGC
Show the answer to question 3:
GACACTCGTGCACTTGAGTAGGCTAExamine the Canu output
4. When Canu has finished with ‘Computing correction layouts’, it will output some summary statistics of the reads. Verify the coverage value from below table. What are the advantages of PacBio sequencing over Illumina sequencing when it comes to genome assembly? To refresh your knowledge on Illumina sequencing, please see:
Category | Original raw reads w/overlaps | Original raw reads w/o/overlaps* |
---|---|---|
Number of Reads | 12,348 | 180 |
Number of Bases | 114,631,263 | 218,769 |
Coverage | 23.882 | 0.046 |
Median | 7,651 | 0 |
Mean | 9,283 | 1,215 |
N50 | 14,222 | 12,135 |
Minimum | 1,000 | 0 |
Maximum | 42,279 | 17,689 |
114,631,263/4.8e6 = 23.882 PB has much longer reads than can span most repeats and has random sequencing error. Check out the mean of 9kb here vs 100 or 150 PE that is typical from Illumina!
Assembly of a diploid genome
E. coli is a haploid genome and its genome size of ~5 Mb is considered small nowadays. It i not a very challenging task to assemble a genome such as E. coli. What is still challenging but is increasingly easier to do is the assembly of larger diploid genomes e.g. cattle is diploid and its genome size is ~2.7 Gb. Can you think of a reason why the assembly of diploid genome is difficult?
Canu is the first assembler that can completely phase a diploid genome. See section ‘Trio Binning Assembly’ in the Quick Start (https://canu.readthedocs.io/en/latest/quick-start.html) to see an example of how to separate a synthetic mix of two E. coli strains (i.e. creating ascenario that is similar to assembling a diploid genome). If you want to try this assembly, please install Canu version 2.0 (https://github.com/marbl/canu/releases) as older versions of the assembler were not equipped with the ability of phase genomes.
For information on a ‘real’ genome project and the first phasing of cattle genomes, please read below article. Can you identify what is the key difference that makes phasing works in the Canu pipeline? In other words, try to identify its key innovation. Hint: study figure 1 of the paper in detail. doi: 10.1101/720797v3.full
You may also be interested in
Tutorial: Genome assembly Quality Metrics
Pair-wise sequence alignment methods
Construction of substitution matrices
Multiple sequence alignment (MSA) tools and resources
Single nucleotide polymorphism (SNP) tools and resources
Back to Tutorials Main Page